Skip to content

Conversation

@kdy1
Copy link
Member

@kdy1 kdy1 commented Nov 20, 2025

No description provided.

@kdy1 kdy1 added this to the Planned milestone Nov 20, 2025
@kdy1 kdy1 self-assigned this Nov 20, 2025
@changeset-bot
Copy link

changeset-bot bot commented Nov 20, 2025

⚠️ No Changeset found

Latest commit: 7b7f3f2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@claude
Copy link
Contributor

claude bot commented Nov 20, 2025

Code Review: PR #11318 - Performance optimization: Make all parsers work by byte instead of char

Overview

This PR refactors the input parsing infrastructure to operate on bytes instead of chars, which is a significant performance optimization. The changes are substantial but well-structured.

Positive Aspects ✅

  1. Performance-focused design: The byte-level operations should significantly improve parsing performance, especially for ASCII-heavy code (which is the common case).

  2. Careful UTF-8 handling: The bump() implementation in crates/swc_common/src/input.rs:133-156 correctly calculates UTF-8 character lengths:

    • 1 byte for ASCII (< 0x80)
    • 2 bytes for chars < 0xe0
    • 3 bytes for chars < 0xf0
    • 4 bytes for chars >= 0xf0
  3. Fallback mechanism: The code properly maintains cur_as_char() for cases where full Unicode character processing is needed (identifiers, strings, etc.), showing good design thinking.

  4. Test coverage: The existing unit tests have been updated to reflect the byte-based API changes.

Issues & Concerns 🔴

1. Potential UTF-8 boundary safety issue (High Priority)

Location: crates/swc_common/src/input.rs:144-152

The UTF-8 length calculation in bump() uses a simplified heuristic that may not be robust for all invalid UTF-8 sequences:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

Issue: This doesn't validate that the subsequent bytes are valid UTF-8 continuation bytes (should be in range 0x80-0xBF). While the input is presumably valid UTF-8 (since it comes from a &str), slicing at incorrect boundaries could cause panics in debug builds or undefined behavior.

Recommendation: Since the input is &str, we know it's valid UTF-8. However, add a comment explaining this assumption and consider using str::chars().next() to get the actual character length for the fast path, or add debug assertions to validate continuation bytes.

2. Unsafe code without complete documentation (Medium Priority)

Locations: Multiple instances throughout the codebase

Many unsafe blocks lack detailed safety comments explaining why the operation is safe. Examples:

  • crates/swc_common/src/input.rs:71, 79, 154
  • crates/swc_ecma_lexer/src/common/lexer/mod.rs: numerous unsafe slicing operations

Recommendation: Per the project's guidelines (#1 and #7 in CLAUDE.md), add comprehensive safety documentation. For example:

// Safety: first_byte comes from self.remaining.as_bytes() which is guaranteed to be
// valid UTF-8. The length calculation follows UTF-8 encoding rules (RFC 3629).
// get_unchecked is safe because we've checked that remaining is not empty.

3. Incorrect ASCII validation (Medium Priority)

Location: crates/swc_common/src/input.rs:160-165

fn cur_as_ascii(&self) -> Option<u8> {
    let first_byte = *self.remaining.as_bytes().first()?;
    if first_byte <= 0x7f {
        Some(first_byte)
    } else {
        None
    }
}

Issue: The check first_byte <= 0x7f should be first_byte < 0x80 for consistency with ASCII definition (0-127). While functionally equivalent, using < 0x80 is more idiomatic and matches the pattern used elsewhere (line 144).

Recommendation: Change to if first_byte < 0x80 for consistency.

4. Missing boundary checks (Medium Priority)

Location: crates/swc_common/src/input.rs:84-92

The bump_one() method has a debug_assert! but will silently corrupt state in release builds if called at EOF:

pub fn bump_one(&mut self) {
    if !self.remaining.is_empty() {
        self.remaining = unsafe { self.remaining.get_unchecked(1..) };
        self.last_pos.0 += 1;
    } else {
        unsafe {
            debug_unreachable!("bump should not be called when cur() == None");
        }
    }
}

Issue: The else branch will still execute in release mode but does nothing, potentially causing position tracking issues.

Recommendation: Consider removing the else branch entirely or adding a panic in debug mode that becomes an unreachable hint in release mode.

5. Complex escape sequence handling (Low Priority)

Location: crates/swc_css_parser/src/lexer/mod.rs:709-787

The string reading logic has become more complex with the byte-based approach, particularly around escape sequences where we need to convert bytes back to chars:

let cur_byte = l.input.cur();
let cur_char = if let Some(b) = cur_byte {
    if is_non_ascii(b) {
        l.input.cur_as_char()
    } else {
        Some(b as char)
    }
} else {
    None
};

Recommendation: This is correct but verbose. Consider extracting this pattern into a helper method like cur_as_char_for_string() to reduce duplication and improve readability.

Performance Considerations ⚡

  1. Excellent byte-search optimization: The use of byte_search! macros with lookup tables (lines 51-66 in lexer/mod.rs) for scanning is an excellent optimization pattern.

  2. Smart fast-path design: The code correctly identifies that ASCII is the common case and optimizes for it while falling back to slower Unicode handling when needed.

  3. Reduced allocations: Operating on byte slices rather than iterating over chars should reduce allocations and improve cache locality.

Testing Recommendations 🧪

Per CLAUDE.md guideline #5, please ensure:

  1. Add tests for edge cases:

    • Multi-byte UTF-8 sequences (2, 3, and 4-byte sequences)
    • Strings containing emoji and other non-BMP characters
    • Mixed ASCII and Unicode content
    • Strings with escape sequences followed by Unicode
  2. Add benchmarks comparing before/after performance, especially for:

    • Large ASCII files
    • Files with mixed ASCII/Unicode
    • Files with many escape sequences

Example test to add:

#[test]
fn test_multibyte_utf8() {
    with_test_sess("hello 世界 🌍", |mut i| {
        // Test bump() correctly handles multi-byte sequences
        assert_eq!(i.cur(), Some(b'h'));
        // ... test full sequence
    });
}

Documentation 📚

Per CLAUDE.md guideline #7, please add:

  1. Module-level documentation explaining the byte-based approach and when to use cur_as_char()
  2. Safety invariants for the Input trait
  3. Examples showing proper usage patterns

Minor Issues

  1. Formatting: Remember to run cargo fmt --all before committing (CLAUDE.md Unignore "431ecef8c85d4d24.js" and "8386fbff927a9e0e.js". #8)

  2. Code comment clarity: Some comments could be more precise:

    • Line 13 in input.rs: "Remaining input as str" could be "Remaining unparsed input slice"

Summary

This is a well-thought-out performance optimization with good architectural decisions. The main concerns are:

  • Must fix: UTF-8 boundary handling needs more robust validation or better documentation of safety assumptions
  • Should fix: Unsafe code needs comprehensive safety documentation
  • Should fix: Add comprehensive tests for multi-byte UTF-8 sequences
  • Nice to have: Minor consistency improvements and helper methods for common patterns

The performance benefits should be significant, especially for typical JavaScript/TypeScript code which is mostly ASCII. Great work on maintaining the fast-path optimization pattern!

Overall Assessment: Approve with requested changes ✅ (pending fixes for UTF-8 safety and documentation)

🤖 Generated with Claude Code

@github-actions
Copy link
Contributor

github-actions bot commented Nov 20, 2025

Binary Sizes

File Size
swc.linux-x64-gnu.node 31M (31932360 bytes)

Commit: e086996

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 20, 2025

CodSpeed Performance Report

Merging #11318 will degrade performances by 3.72%

Comparing kdy1/parser-by-byte (7b7f3f2) with main (be86fad)

Summary

⚡ 3 improvements
❌ 5 regressions
✅ 130 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
css/lexer/bootstrap_5_1_3 11 ms 11.4 ms -3.72%
css/lexer/foundation_6_7_4 9 ms 9.3 ms -3.43%
css/lexer/tailwind_3_1_1 1.8 ms 1.8 ms -3.28%
es/lexer/cal-com 13 ms 13.5 ms -3.21%
es/lexer/typescript 106.9 ms 109.9 ms -2.73%
html/lexer/css_2021_spec 40 ms 38.9 ms +2.93%
html/lexer/github_com_17_05_2022 14.4 ms 14.1 ms +2.34%
html/lexer/stackoverflow_com_17_05_2022 14.5 ms 14.1 ms +2.68%

@claude
Copy link
Contributor

claude bot commented Nov 20, 2025

Code Review for PR #11318: perf(parser): Make all parsers work by byte instead of char

I've reviewed this PR which converts the parsers from character-based to byte-based processing for improved performance. Here's my detailed feedback:

✅ Strengths

  1. Well-structured refactoring: The conversion from char to u8 is systematic and consistent across all parsers (CSS, ECMAScript, HTML, XML).

  2. Smart hybrid approach: The code correctly distinguishes between:

    • Fast path: ASCII bytes (< 0x80) processed directly as bytes
    • Slow path: Non-ASCII bytes where cur_as_char() is called to get full UTF-8 characters
  3. Correct UTF-8 byte length calculation: The bump() implementation in crates/swc_common/src/input.rs:143-152 correctly calculates UTF-8 character lengths:

    let len = if first_byte < 0x80 {
        1 // ASCII
    } else if first_byte < 0xe0 {
        2 // 2-byte UTF-8
    } else if first_byte < 0xf0 {
        3 // 3-byte UTF-8
    } else {
        4 // 4-byte UTF-8
    };

    This matches the UTF-8 specification correctly.

  4. Performance optimization: The new bump_bytes(n) method allows efficient advancement when the byte count is already known, avoiding redundant character boundary calculations.

  5. Proper bump_one() usage: The method is correctly restricted to single-byte ASCII operations (see crates/swc_ecma_parser/src/lexer/table.rs:346).

⚠️ Potential Issues & Concerns

1. Invalid UTF-8 handling (Medium severity)

The UTF-8 byte length calculation assumes valid UTF-8 input. For invalid bytes (0xC0-0xC1, 0xF5-0xFF), it will calculate incorrect lengths:

  • 0xC0, 0xC1: Invalid overlong encodings, but treated as 2-byte
  • 0xF5-0xFF: Invalid (beyond Unicode range), but treated as 4-byte

Recommendation: Since parsers typically receive valid UTF-8 from Rust strings, this is acceptable. However, consider adding a debug assertion or documentation comment noting this assumption.

Location: crates/swc_common/src/input.rs:143-152

2. Missing UTF-8 boundary validation in bump_bytes() (Low severity)

The bump_bytes() method has a debug assertion for length but doesn't verify UTF-8 character boundaries:

pub fn bump_bytes(&mut self, n: usize) {
    debug_assert!(n <= self.remaining.len());
    self.remaining = unsafe { self.remaining.get_unchecked(n..) };
    self.last_pos.0 += n as u32;
}

Recommendation: Add a debug assertion or documentation comment that callers must ensure n aligns with UTF-8 character boundaries.

Location: crates/swc_common/src/input.rs:76-80

3. Character conversion pattern in CSS/HTML parsers (Low severity)

Throughout the CSS and HTML lexers, there's a repetitive pattern:

let ch = if is_non_ascii(c) {
    l.input.cur_as_char()
} else {
    Some(c as char)
};

This is done before consuming the byte. While correct, it adds cognitive overhead.

Recommendation: Consider creating a helper method like cur_as_full_char() that encapsulates this pattern. Example:

impl Input {
    fn cur_as_full_char(&self) -> Option<char> {
        match self.cur()? {
            c if c < 0x80 => Some(c as char),
            _ => self.cur_as_char()
        }
    }
}

Locations: Multiple occurrences in crates/swc_css_parser/src/lexer/mod.rs and crates/swc_html_parser/src/lexer/mod.rs

4. Test coverage (Medium severity)

I notice the PR has a test failure ("Test with @swc/cli" failed). Additionally:

  • No new unit tests specifically for the byte-based Input trait changes
  • Existing tests in crates/swc_common/src/input.rs:392-425 were updated but not expanded

Recommendation:

  • Fix the failing test
  • Add unit tests covering:
    • Multi-byte UTF-8 characters (2, 3, and 4 bytes)
    • Mixed ASCII and non-ASCII text
    • bump_bytes() with various byte counts
    • Edge cases with special Unicode characters (emoji, etc.)

5. Performance verification (Low severity)

While the change should improve performance, I don't see benchmark results in the PR description.

Recommendation: Run the existing benchmarks and include results showing the performance improvement this change provides.

📋 Code Quality

  1. Comments: The code has good inline comments explaining the byte-based approach
  2. Follows CLAUDE.md: ✅ The code avoids unstable features, has comments in English, and appears performant
  3. Documentation: ⚠️ The trait methods could use more doc comments explaining the byte vs. char distinction

🔒 Security

No security concerns identified. The unsafe code blocks are properly justified and bounded by debug assertions.

📊 Summary

This is a solid performance optimization with generally good implementation. The main concerns are:

  1. Fix the failing test
  2. Add more comprehensive test coverage for UTF-8 handling
  3. Consider the recommendations for invalid UTF-8 and helper methods

The core algorithm is correct, and the refactoring is well-executed. Once the test failure is resolved and coverage is improved, this should be ready to merge.


Review generated by Claude Code 🤖

kdy1 and others added 8 commits November 25, 2025 15:05
Updated test assertions in `swc_common/src/input.rs` to use byte literals (b'x') instead of char literals ('x') to match the new byte-based Input trait signature.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Replaced all char literals with byte literals in the lexer to match the new byte-based Input trait.

Key changes:
- Changed char literals to byte literals: 'x' → b'x'
- Fixed debug assertions in lexer methods
- Updated comparisons in peek() and cur() checks
- For non-ASCII Unicode characters, used cur_as_char() to get full UTF-8 character
- Converted bytes to chars where needed for digit/identifier checks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Remove unnecessary `as u8` cast in ECMAScript lexer (clippy error)
- Fix XML parser to use `cur_as_char()` and byte literals for char comparisons
- Fix HTML parser current_char usage (already in staged changes)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…sion

This fixes issues where multibyte UTF-8 characters (like '⬇' and '😀') were being corrupted during parsing after the char-to-byte conversion. The problem was that individual UTF-8 bytes were being cast directly to char using `as char`, which treats each byte as a separate Unicode code point.

**Changes made:**

1. **CSS Parser** (`swc_css_parser`):
   - Updated string tokenization to use `Input::cur_as_char()` for non-ASCII bytes
   - Fixed URL tokenization to properly decode multibyte UTF-8 sequences
   - Fixed bad URL remnant parsing

2. **HTML Parser** (`swc_html_parser`):
   - Removed truly unused functions (`is_upper_hex_digit_char`, `is_lower_hex_digit_char`, `is_ascii_hex_digit_char`, `is_ascii_lower_alpha_char`, `is_ascii_alpha_char`, `append_to_doctype_token`)
   - Kept `is_ascii_upper_alpha_char` which is still in use
   - HTML parser already had correct UTF-8 handling via `current_char` field

**Implementation:**
Before consuming bytes, we now check if it's non-ASCII (`>= 0x80`) and call `cur_as_char()` to get the full UTF-8 character before advancing the input stream. For ASCII bytes, we continue using the fast path of `as char`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@kdy1 kdy1 force-pushed the kdy1/parser-by-byte branch from 18bf7dd to f67c3c0 Compare November 25, 2025 06:05
@claude
Copy link
Contributor

claude bot commented Nov 25, 2025

Code Review: Performance optimization - byte-based parsing

Summary

This PR refactors the parser infrastructure to operate on bytes instead of chars, which should provide significant performance improvements. The changes are well-structured and follow a consistent pattern across all parsers (ECMAScript, CSS, HTML, XML).

✅ Strengths

1. Performance-focused design (aligns with CLAUDE.md #1)

  • Moving from char iteration to byte slicing is a solid optimization
  • The UTF-8 length calculation in input.rs:144-152 is efficient and correct
  • ASCII fast-path optimization in input.rs:194-198 is excellent:
    if c < 0x80 {
        self.input.bump_bytes(1);
    } else {
        unsafe { self.input.bump(); }
    }

2. Correct UTF-8 handling

  • The UTF-8 decoding logic correctly identifies character boundaries using the leading byte
  • Proper fallback to cur_as_char() for non-ASCII characters requiring full char processing
  • The is_non_ascii(c: u8) helper (c >= 0x80) is correct

3. API design improvements

  • Clear separation of concerns with cur() → byte, cur_as_char() → char
  • Added bump_bytes(n) for efficient multi-byte advancement
  • Well-documented with clear safety contracts

4. Good test coverage

  • Existing tests updated to work with bytes (b'o' instead of 'o')
  • Tests verify correct behavior after the refactor

⚠️ Issues & Concerns

1. Critical: Incomplete UTF-8 validation in bump() (input.rs:144-152)

The UTF-8 length calculation assumes well-formed UTF-8 but doesn't validate:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

Problems:

  • Doesn't check for invalid UTF-8 sequences (e.g., 0xC0, 0xC1, 0xF5-0xFF)
  • Doesn't validate continuation bytes (0x80-0xBF)
  • Could read past valid UTF-8 boundaries on malformed input
  • Potential security issue if parsing untrusted input

Recommendation:
Either:

  1. Add validation (slower but safer):
    let len = if first_byte < 0x80 {
        1
    } else if first_byte < 0xC2 {
        // Invalid: overlong encoding or continuation byte
        return Err(...);
    } else if first_byte < 0xe0 {
        // Validate continuation byte
        if bytes.get(1).map_or(false, |&b| b & 0xC0 == 0x80) { 2 } else { return Err(...); }
    }
    // ... etc
  2. Or document that the input must be valid UTF-8 (since Rust strings are UTF-8, this is likely safe)

Since the input comes from Rust &str which is guaranteed valid UTF-8, this is probably safe. However, add a comment explaining this assumption:

// Safety: Input is &str which is guaranteed valid UTF-8 by Rust.
// We can safely decode the UTF-8 length from the first byte without validation.

2. Potential issue: bump_one() at input.rs:85-86

if !self.remaining.is_empty() {
    self.remaining = unsafe { self.remaining.get_unchecked(1..) };

This assumes the first byte is a complete character. This is incorrect for multi-byte UTF-8 sequences. If called on a multi-byte character, it will:

  • Break on the next character boundary check
  • Potentially cause issues with position tracking

Questions:

  • Is bump_one() only called for ASCII bytes?
  • If yes, add an assertion: debug_assert!(self.remaining.as_bytes()[0] < 0x80);
  • If no, this needs to be fixed to use the UTF-8 length calculation from bump()

3. CSS Parser: Mixed byte/char handling (swc_css_parser/src/lexer/mod.rs)

In multiple places, there's inconsistent handling:

let cur_char = if let Some(b) = cur_byte {
    if is_non_ascii(b) {
        l.input.cur_as_char()
    } else {
        Some(b as char)
    }
} else {
    None
};

This pattern is repeated many times (lines 735, 832, 850, etc.). Consider extracting to a helper:

fn byte_to_char(&self, b: u8) -> char {
    if b < 0x80 {
        b as char
    } else {
        self.input.cur_as_char().expect("non-ASCII byte should form valid UTF-8")
    }
}

4. Documentation gaps

  • The bump_bytes() function's safety requirements aren't clear enough. Add:

    /// # Panics (in debug mode)
    /// Panics if `n > self.remaining.len()`
    /// 
    /// # Safety
    /// For UTF-8 inputs, caller must ensure `n` aligns with character boundaries
    
  • The Input trait should document that implementors must handle UTF-8 correctly

5. Code style: Unnecessary casting (aligns with CLAUDE.md #1)

Multiple instances of c as char where c is a byte:

Ok(Token::Delim { value: b'#' as char })  // Line 324
Ok(Token::Delim { value: b'@' as char })  // Line 443

This is fine, but consider if there's a more efficient representation that doesn't require casting at runtime.

6. Potential performance regression in consume() (swc_css_parser/src/lexer/mod.rs:194-234)

The optimization added:

if c < 0x80 {
    self.input.bump_bytes(1);
} else {
    unsafe { self.input.bump(); }
}

This adds a branch for every character. While likely faster overall, consider:

  • Benchmarking to verify the performance improvement
  • The branch might be mispredicted on mixed ASCII/non-ASCII input

🔍 Security Considerations

  1. UTF-8 boundary safety: The bump() implementation assumes valid UTF-8. This is safe for Rust strings but could be problematic if the code is ever adapted for other inputs.

  2. Unsafe slicing: Multiple uses of get_unchecked(). These appear correct given the debug_assert! checks, but:

    • Ensure all code paths maintain the invariants
    • Consider fuzzing to catch boundary issues

📝 Minor Issues

  1. Line 78: debug_assert!(n <= self.remaining.len()); - Good! But also document this in the function doc comment.

  2. Inconsistent patterns: Some parsers check is_non_ascii(b) before calling cur_as_char(), others don't. Standardize this pattern.

  3. Test coverage: The existing tests were updated, but consider adding:

    • Tests with multi-byte UTF-8 characters at various positions
    • Tests with characters at the edge of different UTF-8 byte lengths
    • Boundary tests (e.g., consuming exactly to EOF with multi-byte chars)

🎯 Recommendations

High Priority

  1. ✅ Add comment explaining UTF-8 safety assumption in bump()
  2. ⚠️ Fix or document bump_one() - clarify its ASCII-only use case
  3. 📊 Run benchmarks to quantify performance improvements
  4. 🧪 Add unit tests for multi-byte UTF-8 handling

Medium Priority

  1. Extract repeated byte-to-char conversion pattern into helper
  2. Improve documentation for bump_bytes() safety requirements
  3. Audit all get_unchecked() uses with careful review

Low Priority

  1. Consider if Token::Delim could store u8 instead of char for ASCII delimiters
  2. Add property-based tests for UTF-8 edge cases

✨ Conclusion

This is a high-quality performance optimization that follows good Rust practices. The core concept is sound, and the implementation is mostly correct. The main concerns are:

  1. Clarifying the UTF-8 safety assumptions
  2. Verifying bump_one() is correct for its use cases
  3. Measuring the actual performance impact

After addressing the UTF-8 documentation and bump_one() issues, this should be good to merge. Great work on maintaining test coverage and following the project's performance-first philosophy!


Reviewed following CLAUDE.md guidelines: performance-focused, English documentation, no unstable features, unit tests included.

@kdy1 kdy1 force-pushed the kdy1/parser-by-byte branch from f67c3c0 to 7b7f3f2 Compare November 25, 2025 06:56
@claude
Copy link
Contributor

claude bot commented Nov 25, 2025

Code Review - PR #11318: Make all parsers work by byte instead of char

I've reviewed this PR which refactors parsers to work with bytes instead of chars for performance improvements. Here's my feedback:

Strengths

  1. Performance-focused design: The change aligns with CLAUDE.md rule ecmascript parser #1 ("Write performant code"). Working with bytes is more efficient than char iteration for ASCII-heavy parsing.

  2. Well-structured refactoring: The core change to Input trait properly distinguishes between byte operations (cur(), peek()) and character operations (cur_as_char()).

  3. UTF-8 handling in bump(): The implementation at swc_common/src/input.rs:143-152 correctly calculates UTF-8 character lengths using standard UTF-8 leading byte patterns.

  4. Comprehensive scope: Changes are consistently applied across all parsers (ECMAScript, CSS, HTML, XML).

⚠️ Issues & Concerns

1. CRITICAL: UTF-8 Validation Missing in bump()

Location: crates/swc_common/src/input.rs:143-152

The bump() function calculates UTF-8 character length based solely on the first byte without validating that the subsequent bytes are valid UTF-8 continuation bytes:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

Problem: If the input contains invalid UTF-8 (e.g., malformed sequences), this could:

  • Skip incorrect number of bytes
  • Cause position tracking corruption
  • Potentially panic on subsequent get_unchecked() calls with out-of-bounds indices

Recommendation: Add validation or document that input is guaranteed to be valid UTF-8. Since StringInput uses &str which is guaranteed valid UTF-8 by Rust, this should be safe, but a comment explaining this invariant would improve clarity.

2. Simplified UTF-8 Length Calculation

The current implementation uses a series of if statements. Consider using Rust's standard UTF-8 utilities or add a comment explaining why the custom implementation is preferred (likely for performance).

Could be simplified to:

let len = std::str::utf8_char_width(first_byte);

Or keep current implementation but add comment:

// Fast UTF-8 length calculation based on leading byte
// Safe because &str guarantees valid UTF-8

3. Mixed Byte/Char Handling in CSS Parser

Location: crates/swc_css_parser/src/lexer/mod.rs:726-736, 798-800

The code retrieves both byte and char representations:

let cur_byte = l.input.cur();
let cur_char = if let Some(b) = cur_byte {
    if is_non_ascii(b) {
        l.input.cur_as_char()
    } else {
        Some(b as char)
    }
} else {
    None
};

Issue: This pattern appears multiple times but cur_char is only used in the Some(_) wildcard case. This suggests the byte-to-char conversion logic could be cleaner.

Recommendation: Consider refactoring to only call cur_as_char() when needed, or add a helper function for this common pattern.

4. Unsafe Code Usage

Multiple uses of unsafe { get_unchecked() } throughout:

  • input.rs:71, 79, 86, 154, 201, 244, 265
  • All uses appear justified with proper bounds checking

Recommendation: All unsafe blocks have adequate debug_assert! guards, but consider adding safety comments explaining the invariants, especially for bump_bytes() at line 77-80.

5. Test Coverage

Observations:

  • Unit tests in input.rs updated to use bytes (b'o', b'/', etc.)
  • Tests verify basic functionality (slice, reset_to, bump)

Gaps:

  • No explicit tests for UTF-8 multi-byte character handling
  • No tests for edge cases with non-ASCII characters
  • No performance benchmarks included (though CI shows pending benchmark jobs)

Recommendation: Add test cases covering:

#[test]
fn src_input_multibyte_utf8() {
    with_test_sess("hello世界", |mut i| {
        // Test bump() correctly handles multi-byte UTF-8
        // Test cur() returns first byte of multi-byte char
        // Test cur_as_char() returns full character
    });
}

6. API Consistency

The trait now has:

  • cur()Option<u8> (byte)
  • cur_as_ascii()Option<u8> (byte if ASCII)
  • cur_as_char()Option<char> (character)

This is good separation, but documentation could be clearer about when to use each method.

📝 Minor Issues

  1. Line 144-152: Consider extracting UTF-8 length calculation to a separate const fn for reusability and testing.

  2. Documentation: The trait methods have good docs, but the bump() implementation could use inline comments explaining the UTF-8 encoding scheme.

  3. Performance: The refactoring should show measurable improvements, but I see benchmark jobs are still pending in CI.

🔒 Security

No security concerns identified. The refactoring maintains safety through:

  • Rust's &str UTF-8 guarantees
  • Proper bounds checking with debug assertions
  • No new attack surface introduced

Adherence to CLAUDE.md

  • ✅ Rule 1: Performance-focused (main goal)
  • ✅ Rule 2: English comments/docs
  • ✅ Rule 3: No unstable features used
  • ✅ Rule 5: Unit tests present
  • ⚠️ Rule 6: Tests updated but not removed (good)
  • ⚠️ Rule 7: Could use more inline documentation
  • ❓ Rule 8: cargo fmt likely run (formatting looks consistent)

📊 Overall Assessment

This is a solid refactoring with clear performance benefits. The core idea is sound and implementation is mostly correct. Main concerns:

  1. Add validation explanation/comments for UTF-8 handling in bump()
  2. Add tests for multi-byte UTF-8 characters
  3. Consider extracting common byte-to-char conversion pattern in CSS parser

Recommendation: Approve with minor changes

The changes are safe due to Rust's UTF-8 guarantees, but adding the suggested documentation and tests would significantly improve confidence and maintainability.


Review generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants